Lecture 03

Author

Bill Perry

Lecture 2: Review

  • We covered inductive vs deductive reasoning
  • How to begin to ask questions
  • Accuracy and precision
  • What are general types of data
  • How to set up an R project in Rstudio
  • How to install and load libraries
  • How to read a file into R
  • How to make a graph

Our first graph

Lecture 3: How to deal with data wrangling

  • Data management overview
  • How to make a tidy spreadsheet
  • Metadata - why you really should use it
  • Data repositories
  • R in practice

New image here

Lecture 3: Data management overview

  • Data: the raw material of science
  • Wide variety of formats, sizes, complexity
  • Data management and curation often under emphasized
  • Good data management: owe it to our funding agencies, colleagues, supervisors, and study systems

Lecture 3: Data gathering - managing

Step 1

  1. Decide on what type of data you are collecting
  2. Decide on controlled vocabulary - odo_mgl, drp_ugl
  3. Decide on what has to happen to the data flow
  4. Organize your project -
  5. Enter the data as soon as you can
  6. in a spreadsheet as excel and csv
  7. you really need to be sure it is tidy

Tidy data by Whickham

Step 2:

  1. Make a MetaData sheet a. data about data b. descriptions, units, etc.
Practice Exercise 1: Can you do this for the pine data we have collected?

Let’s recreate the basic histogram of fish lengths from our last class. Use the sculpin_df data frame that’s already loaded.

# Write your code here to read in the file
# How do you examine the data - what are the ways you think and lets try it!

Lecture 3: Data gathering - managing

Step 3

  1. Store your raw data and metadata
  2. Electronic dataframes should be stored in ≥3 copies:
    1. Your computer (onsite)
    2. External storage (onsite)
    3. Offsite storage (e.g., cloud storage)
  3. Have regular backup strategy

Lecture 3: Data gathering - managing

Step 4

  1. Graph your data and check outliers, errors, missing data
  2. You can choose a NA or space… opinions differ
    1. you can set this when you import data
# Specifying NA values explicitly
data <- read_csv("your_file.csv", 
    na = c("", "NA", "N/A", "missing", "null"))

Practice Exercise 2: Lets plot our data again

Let’s recreate the basic plots you might use to visualize the data and lets see what it looks like.

# Lets practice making a plot!
# What are the ways you want to see data and lets try them!

Lecture 3: Data gathering - managing

Step 5

Cleaning Data -

  1. Correct errors, fill missing data with “NA”, resolve outliers
  2. save a clean data file as the master file - often good to make read only
  3. Add information in a notes column or text file about what was done and why.

Lecture 3: Data gathering - managing

Step 6

  1. Time to graph the data and explore, summarize, and transform data
  2. If there are a lot of steps in cleaning and doing transfomations and calculations save them to new output file.

A good way to organize script files is number them in the order they get run.

Lecture 3: Data gathering - managing

The important considerations in data

  1. enter field/ lab data into electronic format as soon as possible and back it up in at least one location (e.g., cloud storage)
  2. do not modify raw data in any way following entry into electronic format
  3. store all data in an open-access format (e.g., .csv)
  4. thoroughly check and clean your raw data, saving it as a separate file (e.g., “output/cleaned_raw_data.csv”)
  5. accompany raw field/lab data with meta-data that is unambiguously linked to the raw data file
  6. carry out all analyses, calculations and visualization on a separate file from the “raw“ or “clean” data: the “analysis” data
  7. perform all data transformation, analysis and visualization by reproducible code and code shall be stored together with data
  8. arrange all raw and analysis data in “instance-row, variable-column” or tidy format: one column per variable

USE CONTROLLED VOCABULARY AND BE CONSISTENT THINK BEFORE DOING –> WHAT HAPPENS DOWN THE ROAD

Lecture 3: Data gathering - managing

Broman KW, & Woo KH. 2018. Data organization in spreadsheets. The American Statistician 72: 2-10 (HERE)

  • Spreadsheets break data - use with extreme caution
  • Spreadsheets: data entry and storage
  • R: visualization and analysis
  • Goal: organize data so readable by humans and computers

Lecture 3: Data gathering - managing

Be consistent!

  • Variable names
    • Codes for categorical variables
    • Variable names
      • use snake case and lower case - nitrate_n_mgl
      • always use the same name
    • Codes for missing values - NA or 9999 or a space - I know but I do it
    • Date formats -
      • YYYY-MM-DD HH:MM:SS
      • Time begins in 1970-01-01
    • names of objects
      • dataframes after import data_df
      • plots - len_wt_plot
      • models - anova_wt_model
    • File names
      • use separators - 2025_02_01_lake_x_inflow.csv
  • Note format Requires considerable foresight and organization

example of fish data

Practice Exercise 3: How do we fix variable names?

So there are two issues

  1. what can you do when reading in file?
  2. what can you do when the file is in and need to fix things
# lets do # 2 first - no pun intended
# if you wanted to rename variables what would would you do?

# now time for #1 - there are tools to make your life easier
# install.packages("janitor") # what does a janitor or BSW do?
# library(janitor)
# lets read in a messy file... junk.csv
# first look at the file
# df <- read_csv("data/junk.csv)
# df_excel <- read_excel("data/junk.csv")

Lecture 3: Data gathering - managing

Variable and file names can be a problem

  • Avoid spaces but use underscore _
  • Avoid special characters @#$%^@#
  • Be sure to also use a variety of separators so you can separate later
    • or use the same number of characters across a variable name
    • 2025_03_04_file-site

Lecture 3: Data gathering - managing

Excel will drive you mad

  • it will mess up your dates
  • store data in separate columns - year - month - day
  • or use a string 20250401
  • always use unambiguous format of larges to smallest - why?
    • is 01 04 2025 the same as 04 01 2025

    • what are the dates in english?

    • or European

Practice Exercise 3: Are there ways to deal with excel and why is it a problem

Let’s look at junk.xlsx

# Write your code here to create a histogram of fish lengths from Toolik Lake
# Remember to use the pipe operator %>% and ggplot with geom_histogram()
# copy the date to a new cell and make a number!

Lecture 3: Data gathering - managing

Never do Calculations in Excel

  • always do calculations in R - reproducible
  • never merge cells
  • can use highlighting but it will disappear
  • a nice rectangular dataframe will make you happy
    • tears will flow if not

Lecture 3: Data gathering - managing

Meta Data

  • This data will love beyond you
  • Someone will need to interpret it - what do they need
    • What is data about

    • Who collected it

    • When

    • Where

    • Funding agency

    • Methods used to collect

    • Variable names

      • description

      • units

      • abbreviations

    • CALCULATIONS AND WHY?

We need to know what happened and why and the units and WTF it means?

TGW - yep its a thing

ODO - what do you think it is?

NO3 - what is it? Are you sure? Why might you get in legal trouble if you used this?

Back to top